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1. INTRODUCTION 

Classification is considered as one of the machine learning tasks, which have been widely used 
recently to categorize the data into classes [1]-[5]. Classification techniques predict the classes of the data 
instances based on a given set of data fields (features). Using the original number of features may be time- 
consuming and may mislead the classification process, so the methods for feature selection chose a minimum 
set of features that lead to better learning accuracy and less computational cost. The methods for feature 
selection are separated into three categories [6]: i) filter-based methods that use statistical approaches to 
assess the correlation between features and the class, ii) wrapper-based methods: assess the selected features’ 
subset using a machine learning algorithm, and iii) embedded-based methods: combines the advantage of 
wrapper methods and filter-based methods [6]-[8]. 

Feature selection is an nondeterministic polynomial (NP)-problem because of its high-dimensional 
space [9]-[12] where the exhaustive search is unfeasible. To perform the feature selection task, an efficient 
search algorithm is required. Swarm intelligence is a group of population-based algorithms that contain 
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algorithms inspired by the social insects/animals’ behaviors, which are called nature-inspired optimization 
algorithms [6], [13]-[17]. 

Several nature-inspired optimization algorithms were implemented for solving feature selection 
problems using the wrapper-based method, since its simple, natural representation and efficient in global 
search [6], [7], [18]-[22]. Particle swarm optimization mimics the behavior of birds and is utilized to find the 
best set of features [8], [23]-[25]. Ant colony optimization simulates the ants’ behavior in searching for food 
and it has been used for feature selection problems [26]-[32]. Cat swarm optimization is inspired by cats 
searching for their prey [33], [34]. Grey wolf optimization algorithm depends on how the wolf pack behaves 
(the hierarchy and hunting) [35], [36]. The behavior of genuine moths in looking for light sources is 
mimicked by moth flame optimization algorithms [37]—[40]. 

In Yang and Deb [41] the authors presented the cuckoo search (CS) algorithm for continuous 
optimization problems [42], [43], CS is based on the attractive cuckoo bird’s breeding method. CS algorithm 
was effectively proposed to problems from different domains such as mobile robot navigation [44] and 
reliability-redundancy allocation [45]. CS algorithm has some advantages compared with other nature- 
inspired optimization algorithms such as it explores some elitism types. Also, in CS, randomness is more 
useful as a move size, where it is heavy-tailed with any likely large move size. And, because there are fewer 
parameters to tune than with genetic algorithms and particle swarm optimization, it may be easier to adapt to 
a broader range of optimization problems [42], [46]. Similarly, CS has global optima achievement and rapid 
convergence. A binary CS (BCS) proposed for feature selection problems in [46], [47]. But there is a 
limitation in the CS algorithm, which is it has a slow convergence speed [47]. Modified CS algorithm with 
rough sets is proposed by [48] for feature selection problem, In the modified version some cuckoo species 
use the obligate brood parasitic behavior and some birds the Lévy flight behavior. The dimensionality of the 
datasets required an efficient search algorithm to discover the optimum features’ subset for better prediction, 
thus we propose a modified CS algorithm with great deluge algorithm (GD) as local search, to overcome the 
slow convergence speed of the CS algorithm along with avoiding the CS algorithm getting trapped in local 
optima. 

This paper is organized: the cuckoo search algorithm is presented in section 2 along with the 
proposed approaches and great deluge algorithm. The results with some discussions are listed in section 3. 
Comparison of CS algorithm with other results from the literature are stated in section 4. The conclusion and 
future research are stated in section 5. 


2. METHOD 

The methodology presented here contains information about the original CS algorithm. Followed by 
the modification of CS for feature selection problem and description of great deluge as a local-search 
algorithm. Finally, the details of CS with great deluge algorithm. 


2.1. Cuckoo search algorithm (CS) 

CS algorithm was firstly proposed in [41]. The behavior of force brood parasitic for cuckoo 
inspired the authors in [41] to develop a CS algorithm. This behavior starts from the cuckoo laying the eggs 
in another small bird’s nest (host), normally the cuckoo’s eggs hatch before the eggs of the host, then the 
cuckoo chick discovers an outlandish egg and decided to through other eggs. The representation of the CS 
algorithm is: the population is represented by the nest, the solutions in the population are represented by 
eggs and the new solutions that are produced using Levy-flight are the cuckoo’s eggs. Then the new solution 
is compared to the other solutions and the best solutions are replaced by the worst solutions. The three main 
rules in the CS algorithm are stated [41]: i) each cuckoo puts one egg in the random nest, ii) the higher 
quality nests are set aside and considered for further improvement, and iii) the number of nests is 
predetermined, for each nest, the cuckoo chick finds an outlandish egg using the probability between 0 
and 1. Then the host chooses to abandon the nest or throw the egg. 

The levy-flight presented in (1) is used by the CS algorithm for creating the new solutions. 


Xi+1 = xyta@Lvy(A) (1) 


The produced new solution for (cuckoo i) is xi, a is greater than 0, which represents the move size, À 
represents a constant of distribution of levy. Usually, the left side term in (1) denotes the random move 
where the next location is based on the current location and the left term in (1) is the probability of a 
transition, ® representing the entry-wise multiplication. The move size here is multiplied by a number 
chosen at random with a distribution of levy. 
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The random move using levy-flight is efficient for exploring all search regions, using its move 
length is longer. Levy-flights allow a random move, and the random move length is given by a levy 
distribution as shown in (2) [41]. 


Lvy~wu=t-A,(1<As3 (2) 
The process of the CS algorithm is presented in Figure 1, firstly the algorithm starts with initializing the 


population with the number of host nests, after that in every iteration a randomly selected cuckoo (solution) 
for generating a new solution using levy-flights. 


Set Nests:the number of nests 
Initialize the Population with Nests; 
Set CSNiter: the number of Iterations; 
i=0; 
for i=0 to CSNiter do: 
Co= a randomly selected cuckoo 
Co*= levy-flight(Co);//produce a new solution for Co 
Funco*=FitnessFunction(Co*); 
RN=Random nest among Nests 
Funy= FitnessFunction(Co); 
if Funcox < Funy 
RN = Co*;// Replace the solution 
endif 
remove the worst nests and replace new nests using levy-flight 
i.e. The solutions of fraction pa€[0,1] are abandon 
output the best nests; 
endwhile 


Figure 1. The CS algorithm Pseudo-code [41] 


2.2. Great deluge algorithm 

Among algorithms that are based on water behavior [49], the great deluge (GD) algorithm was 
firstly proposed by Dueck in 1993 [50], GD uses an acceptance criterion for accepting the neighbor solutions. 
GD simulates the hill climber path in a great deluge while trying to maintain his feet dry. GD accepts the 
neighbor solutions with worse objective value based on the water level (a boundary value). The level value 
starts reducing with the decay rate during the search process. Reducing the value of level encourages the 
working solution to consistently reduce till convergence. 

The whole process of the GD algorithm is represented in Figure 2, the algorithm starts with 
initializing the parameters then the iterative process starts, in every iteration, producing k neighboring 
solutions from the input (current) solution (Co), line-8 Figure 2. The produced neighboring solution is 
accepted if it’s better than the current solution or less than or equal to the water level (boundary), this 
condition helps the GD to avoid getting trapped in local optima. 


Function GD(Co): The input solution 
Initialize K:# of neighborhood solutions 
Initialize rain-Speed 
Initialize GDIter: # of iterations: 
water-Level — f(Co) 
decay-Rate — f(Co) x rainSpeed/GDiter 
i=0 
while(i< GDIter) 
Co* — produce-Neighbour(Co,K) 
if f(Co*) < f(Co) or f(Co) < water-Level 
Co — Co*//accept the solution 
endif 
water-Level — water-Level — decay-Rate 
i++ 
endwhile 
output Co 


oOMAHAtAOFWNHE 


PRPPRPRPEREH 
NOuBWNHEO 


Figure 2. Pseudocode of the great deluge algorithm [51] 
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2.3. Cuckoo search with great deluge algorithms 

CS algorithm in this section is utilized to select the best subset of features. The solution can be 
represented for the feature selection problem as a matrix with size N containing 0 and 1, where N is the 
whole features number in the given dataset, 0 indicates that the feature is not chosen, while 1 indicates that it 
is chosen. CS algorithm disuse solutions based on a fraction, and produces a new solution, at an early stage of 
the CS algorithm process, disusing the solution may be time-consuming and solutions didn’t improve, and 
not enough iterations left to start improving a new solution, so an updating strategy before desertion the 
solutions are required to improve the solutions by accepting the worse neighbor solutions. 

The levy-flight is used by the CS algorithm to produce a new solution, we propose updating a 
strategy that uses the great deluge algorithm with two neighborhood strategies, to avoid the CS algorithm 
from getting stuck in the local optima and to speed up the convergence. Figure 3 represents the process of 
cuckoo search algorithm with great deluge algorithm. The neighborhood strategies can be explained [52]. 
Let’s consider the solution is Co=[0, 0, 1, 1, 0, 1, 1, 0, 1, 0], so the neighborhood strategies are: i) move 
neighborhood: chooses a feature at random and move its position to a new random position and ii) swap 
neighborhood: chooses two features at random and swap values. 


Initialize the population 
with Nst host nests 
Set i=0, Niter 
Set pa€[0,1] 


Evaluate the Population 


yes 


Final Solution & 
Accuracy 


Yy 
Co= select_random_cuckoo 


Co*= GD(Co) See figure 2 


Func:=Evaluatefitness(Co*); 


RN=random_nest() 
Funpn=FitnessFunction(Co); 


No 


y 
The solutions of fraction pa€[0,1] 
are abandon 


build new nests using levy_flights 


| 


save the best nests | 


Find the current best nests 
i=i+1 


Figure 3. The process of CS algorithm with GD 


In feature selection, two objectives should be taken into account to produce a good solution for the 
problem, where the accuracy should be maximized as much as possible with minimizing the number of 
selected features. thus, the k-nearest neighbor classifier (KNN) [53], used to produce the mean accuracy 
using 10-fold-cross-validation [54], and the input features are given by the algorithm as a solution. So, the 
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objective function (OF) in equation 3 is considered both objectives (maximize accuracy while reducing the 
number of features selected) [53]. 


OF = aE +B =| (3) 


Where the value of a is a parameter between 0 and 1 and B=1-a, E is the rate of error given by the KNN 
classifier. S is the selected features’ number and N is the features’ total number. 


3. RESULTS AND DISCUSSION 

The performance of this work is tested in this section using 9 UCI datasets which are used in several 
well-confirmed research. These datasets are presented in Table 1 [55]. Experimental results based on 
different values of parameters show the CS algorithm's final parameter settings as presented in Table 2. In the 
original CS algorithm, parameters pa, a, and A initialized firstly based on [56]. The findings of this study 
have been implemented using a personal computer with the specifications: Intel i5-2.30 GHz Processor and 
RAM of 8.0 GB. And, the results are conducted over 10 runs. The datasets are split into 80 training and 20 
testings [55]. 


Table 1. UCI datasets used 


Dataset Features Instances 

1 German 20 1000 
2 Breastcancer 9 699 
3 Spect 22 267 
4 Krvskp 36 3196 
5 Ionosphere 34 351 
6 Sonar 60 208 
7 Lymphography 148 18 

8 Tic-tac-toe 9 958 
9 Wdbc 30 569 


Table 2. Final parameters settings for CS algorithm 


Algorithm Parameter Name Value 
cs Pa 0.3 
a 1 
a 1.5 
Size of population 10 
CS Iterations (CSNiter) 100 
GD GD Iterations (GDiter) 100 
#neighborhood solutions (K) 2 
rain-Speed 0.5 


3.1. Comparison between CS algorithm and CS with local search (CS_GD) 

The results of the original CS algorithm presented in this section are evaluated and compared with 
the proposed approach (CS_GD) to show its effectiveness. Table 3 shows the results, where the best results 
for each dataset are represented by bold font. These two algorithms are compared based on the testing mean 
accuracy in 10 folds cross-validation, the average accuracy of 10 runs, the selected features, and the time in 
seconds taken to finish the process. 


Table 3. Results comparison between CS algorithm and the proposed CS_GD 


Dataset CS CS_GD 
Accuracy Average #Features Average Accuracy Average #Features Average 

Accuracy Selected Time Accuracy Selected Time 
German 74.0 70.8 7 35.7 75.5 72.5 9 48.8 
Breastcancer 92.9 92.9 3 15.3 92.9 92.9 3 28.0 
Spect 83.3 75.7 16 10.2 83.3 76.9 11 15.1 
Krvskp 96.6 94.7 14 161.6 97.5 95.1 10 232.5 
Ionosphere 85.9 82.8 7 10.5 87.3 83.4 6 16.4 
Sonar 88.1 76.9 20 9.2 90.5 78.3 11 14.2 
Lymphography 80.0 69.7 3 6.8 80.0 72.7 4 10.5 
Tic-tac-toe 89.1 89.1 9 16.5 89.1 89.1 9 30.7 
Wdbc 92.1 91.0 8 13.1 93.0 90.1 6 23.9 
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The CS_GD algorithm presented better accuracies for 5 datasets and produces 4 similar accuracies 
compared with the CS algorithm, and the results show that CS_GD 8 datasets have fewer selected features. 
Based on the average computation time represented in Table 3 shows that the CS_GD algorithm needs 
slightly more time to complete the process, nonetheless this extra time is worth it to produce better results. 
The behavior of the CS_GD algorithm is presented in Figure 4 for Lymphography and German datasets, 
where the objective function (OF) value is presented (3), in Figure 4, the number of iterations is represented 
by the x-axis and the y-axis represents the objective function. 
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Figure 4. The convergence behavior of the current solution of CS and CS_GD 


Using the great deluge algorithm, the solution is accepted in some cases based on the water level, so 
the worst solution is accepted in some iterations to which helps the algorithm to escape from getting trapped 
in local the optima and getting better solutions that improve the objective function, also GD speed up the 
convergence behavior. Boxplots of accuracies produced by the CS and CS GD algorithms are compared and 
exhibited in Figures 5 and 6 to investigate the reliability and stability of the findings. Each box shows the 
median which is represented by the middle line in the box, while the top and bottom lines represent the 
minimum and maximum values, respectively. The box plots for breast cancer and tic-tac-toe datasets show 
the middle, top, and bottom lines as one line, which means that a similar result is represented for all runs, but 
other box plots illustrate that the variance of maximum and minimum values in most of the datasets are as 
acceptable and small, which represents the reliability and stability of the results. 
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Figure 5. Boxplots of CS algorithm for all datasets 
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Figure 6. Boxplots of GD_CS algorithm for all datasets 


3.2. Comparison between CS_GD and other nature-inspired algorithms 

Three nature-inspired algorithms (bat algorithm (BAT), particle swarm optimization (PSO), and 
firefly optimization algorithm (FFO)) are compared with the superior algorithm from the previous section 
(CS_GD), The comparison is represented in Table 4, based on the average accuracy produced by 10 
independent runs and the selected features’ number. As shown in Table 4 the average accuracies for CS_GD 
outperform the PSO, BAT, and FFO algorithms for 7 datasets out of 9 datasets, 2 of them are similar; for the 
Lymphography dataset, it has the same average accuracy in CS_GD and BAT algorithms and for Tic-tac-toe 
have also the same average accuracy between CS_GD and FFO algorithms. Figure 7 represents the visual 
results using the column chart to view the differences between the algorithms. 


Table 4. Results of CS_GD, PSO, BAT and FFO algorithms 


Dataset CS_GD PSO BAT FFO 
Average #Features Average #Features Average #Features Average #Features 
Accuracy Selected Accuracy Selected Accuracy Selected Accuracy Selected 
German 72.5 9 68.5 16 70.1 8 71.5 9 
Breastcancer 92.9 3 95.6 4 95.1 2 93.1 3 
Spect 76.9 11 73.9 24 74.6 7 74.1 18 
Krvskp 95.1 10 89.3 21 62.8 8 94.4 15 
Ionosphere 83.4 6 80.4 18 82.5 10 81.3 10 
Sonar 78.3 11 78.1 40 74.8 16 76.4 27 
Lymphography 72.7 4 71.7 9 72.7 7 68.3 4 
Tic-tac-toe 89.1 9 81.5 5 66.5 1 89.1 9 
Wdbc 90.1 6 92.1 17 90.3 8 91.1 10 
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Figure 7. The comparison between CS_GD and other nature-inspired algorithms 
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The t-test is used to assess the significance of the acquired findings by calculating the difference 
between the means of two groups. Table 5 provide the p-values that were obtained after applying the t-test 
for CS_GD and other algorithms' average accuracy results. These statistical tests show that the observed 
improvements and differences are significant. Where the detected differences between the CS_GD and FFO 
are significant (p-value<0.05) for 6 out of 8 datasets, the same number for the PSO algorithm. 


Table 5. T-test of CS_GD with other algorithms 


Dataset PSO BAT FFO 
German 0.0070 0.0493 0.0120 
Breastcancer 4.82E-05 0.0024 0.0839 
Spect 0.2232 0.2799 0.0070 
Krvskp 0.0072 4.18E-09 0.1527 
Ionosphere 0.0188 0.2300 0.0204 
Sonar 0.4504 0.1452 0.0239 
Lymphography 0.3610 0.5000 0.0136 

Tic-tac-toe 0.0009 3.81E-08 -- 

Wdbc 0.0072 0.4181 0.0805 


3.3. Comparison of CS GD algorithm with other results from the literature 

For the nine datasets studied in this work, the comparison between the best results achieved using 
the CS_GD method and the best-known solutions from the literature is presented in this section. Table 6 
compares the CS_GD algorithm with the most well-known findings of other algorithms from the literature. 
When assessing the algorithms’ performance, accuracy is considered the primary goal. The highest level of 
precision is shown in Table 6. As seen in Table 6, the CS_GD algorithm outperformed the other literature 
results in 5 out of 9 datasets in terms of accuracy and has a comparable result with other comparators. 


Table 6. Results of CS_GD Algorithm compared with some literature results 


Dataset CS_GD Literature results | Taken From 
German 75.5 81.50 24] 
Breast cancer 92.9 98.00 57] 
Spect 83.3 82.60 57] 
Krvskp 97.5 96.80 57] 
Ionosphere 87.3 79.8 58] 
Sonar 90.5 86.70 59] 
Lymphography 80.0 85.30 60] 
Tic-tac-toe 89.1 80.80 57] 
WDBC 93.0 97.00 24] 


4. CONCLUSION AND FUTURE WORK 

In this work, a hybridized CS algorithm with GD algorithm was introduced for feature selection 
problem. Where, two objectives were considered for the feature selection problem, minimizing the number of 
selected features and maximizing the prediction accuracy as possible. Thus; to achieve these objectives an 
effective algorithm is necessary to be used to find an optimum solution of the problem. CS algorithm needs 
to be modified to improve the convergence speed of the algorithm and to produce good results, thus GD 
algorithm is proposed to enhance the solutions and to provide faster convergence to CS, due to the ability of 
the GD algorithm to accept some worse moves to find better solutions. Using nine UCI datasets the 
efficiency of the algorithm was exposed, the proposed method effectively finds good solutions compared 
with some other nature-inspired algorithms and other comparable state-of-the-art methods. Our future work is 
to find an automatic parameter tuning approach to set the parameters used and to investigate the proposed 
algorithm in other domains. 
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